INTRODUCTION:
Weather is important to everyone. Every time when we go out, we will ask questions about weather: What's the temperature outside? Is it going to rain? And that's why we choose to do data exploration on weather.
Historical Hourly Weather Data:
The dataset contains 5 years of hourly measured data of various weather attributes, such as temperature, humidity, pressure, etc. (more than 40,000 rows) This data is available for 30 US and Canadian Cities, as well as 6 Israeli cities.
URL:http://www.kaggle.com/selfishgene/historical-hourly-weather-data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyhive import hive
conn = hive.connect("quickstart.cloudera", username="cloudera")
cursor = conn.cursor()
cursor.execute('DROP DATABASE IF EXISTS weather CASCADE') # if one was created before
cursor.execute('CREATE DATABASE weather')
cursor.execute("CREATE TABLE IF NOT EXISTS humidity (\
datetime STRING,\
Vancouver INT,Portland INT,SanFrancisco INT,Seattle INT,\
LosAngeles INT,SanDiego INT,LasVegas INT,Phoenix INT,\
Albuquerque INT,Denver INT,SanAntonio INT,Dallas INT,\
Houston INT,KansasCity INT,Minneapolis INT,SaintLouis INT,\
Chicago INT,Nashville INT,Indianapolis INT,Atlanta INT,\
Detroit INT,Jacksonville INT,Charlotte INT,Miami INT,\
Pittsburgh INT,Toronto INT,Philadelphia INT,NewYork INT,\
Montreal INT,Boston INT,Beersheba INT,TelAvivDistrict INT,\
Eilat INT,Haifa INT,Nahariyya INT,Jerusalem INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/humidity.csv' OVERWRITE INTO TABLE humidity")
cursor.execute('ALTER TABLE humidity set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM humidity')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
cursor.execute("CREATE TABLE IF NOT EXISTS pressure (\
datetime STRING,\
Vancouver INT,\
Portland INT,\
SanFrancisco INT,\
Seattle INT,\
LosAngeles INT,\
SanDiego INT,\
LasVegas INT,\
Phoenix INT,\
Albuquerque INT,\
Denver INT,\
SanAntonio INT,\
Dallas INT,\
Houston INT,\
KansasCity INT,\
Minneapolis INT,\
SaintLouis INT,\
Chicago INT,\
Nashville INT,\
Indianapolis INT,\
Atlanta INT,\
Detroit INT,\
Jacksonville INT,\
Charlotte INT,\
Miami INT,\
Pittsburgh INT,\
Toronto INT,\
Philadelphia INT,\
NewYork INT,\
Montreal INT,\
Boston INT,\
Beersheba INT,\
TelAvivDistrict INT,\
Eilat INT,\
Haifa INT,\
Nahariyya INT,\
Jerusalem INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/pressure.csv' OVERWRITE INTO TABLE pressure")
cursor.execute('ALTER TABLE pressure set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM pressure')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
cursor.execute("CREATE TABLE IF NOT EXISTS temperature (\
datetime STRING,\
Vancouver FLOAT,\
Portland FLOAT,\
SanFrancisco FLOAT,\
Seattle FLOAT,\
LosAngeles FLOAT,\
SanDiego FLOAT,\
LasVegas FLOAT,\
Phoenix FLOAT,\
Albuquerque FLOAT,\
Denver FLOAT,\
SanAntonio FLOAT,\
Dallas FLOAT,\
Houston FLOAT,\
KansasCity FLOAT,\
Minneapolis FLOAT,\
SaintLouis FLOAT,\
Chicago FLOAT,\
Nashville FLOAT,\
Indianapolis FLOAT,\
Atlanta FLOAT,\
Detroit FLOAT,\
Jacksonville FLOAT,\
Charlotte FLOAT,\
Miami FLOAT,\
Pittsburgh FLOAT,\
Toronto FLOAT,\
Philadelphia FLOAT,\
NewYork FLOAT,\
Montreal FLOAT,\
Boston FLOAT,\
Beersheba FLOAT,\
TelAvivDistrict FLOAT,\
Eilat FLOAT,\
Haifa FLOAT,\
Nahariyya FLOAT,\
Jerusalem FLOAT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/temperature.csv' OVERWRITE INTO TABLE temperature")
cursor.execute('ALTER TABLE temperature set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM temperature')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
cursor.execute("CREATE TABLE IF NOT EXISTS description (\
datetime STRING,\
Vancouver STRING,\
Portland STRING,\
SanFrancisco STRING,\
Seattle STRING,\
LosAngeles STRING,\
SanDiego STRING,\
LasVegas STRING,\
Phoenix STRING,\
Albuquerque STRING,\
Denver STRING,\
SanAntonio STRING,\
Dallas STRING,\
Houston STRING,\
KansasCity STRING,\
Minneapolis STRING,\
SaintLouis STRING,\
Chicago STRING,\
Nashville STRING,\
Indianapolis STRING,\
Atlanta STRING,\
Detroit STRING,\
Jacksonville STRING,\
Charlotte STRING,\
Miami STRING,\
Pittsburgh STRING,\
Toronto STRING,\
Philadelphia STRING,\
NewYork STRING,\
Montreal STRING,\
Boston STRING,\
Beersheba STRING,\
TelAvivDistrict STRING,\
Eilat STRING,\
Haifa STRING,\
Nahariyya STRING,\
Jerusalem STRING)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/weather_description.csv' OVERWRITE INTO TABLE description")
cursor.execute('ALTER TABLE description set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM description')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
cursor.execute("CREATE TABLE IF NOT EXISTS windspeed (\
datetime STRING,\
Vancouver INT,\
Portland INT,\
SanFrancisco INT,\
Seattle INT,\
LosAngeles INT,\
SanDiego INT,\
LasVegas INT,\
Phoenix INT,\
Albuquerque INT,\
Denver INT,\
SanAntonio INT,\
Dallas INT,\
Houston INT,\
KansasCity INT,\
Minneapolis INT,\
SaintLouis INT,\
Chicago INT,\
Nashville INT,\
Indianapolis INT,\
Atlanta INT,\
Detroit INT,\
Jacksonville INT,\
Charlotte INT,\
Miami INT,\
Pittsburgh INT,\
Toronto INT,\
Philadelphia INT,\
NewYork INT,\
Montreal INT,\
Boston INT,\
Beersheba INT,\
TelAvivDistrict INT,\
Eilat INT,\
Haifa INT,\
Nahariyya INT,\
Jerusalem INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("LOAD DATA LOCAL INPATH '/home/cloudera/wind_speed.csv' OVERWRITE INTO TABLE windspeed")
cursor.execute('ALTER TABLE windspeed set \
TBLPROPERTIES ("skip.header.line.count"="1")') #remove first null line
cursor.execute('SET hive.cli.print.header=true') #To print the name of the variables
cursor.execute('SELECT * FROM windspeed')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['datetime','Vancouver','Portland','SanFrancisco','Seattle','LosAngeles','SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio','Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago','Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte','Miami','Pittsburgh','Toronto INT','Philadelphia','NewYork','Montreal','Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']))
rows.head()
cursor.execute('DROP TABLE IF EXISTS Boston') # if one was created before
cursor.execute("CREATE TABLE IF NOT EXISTS Boston (\
datetime STRING,humidity INT,pressure INT)\
ROW FORMAT DELIMITED \
FIELDS TERMINATED BY ','")
cursor.execute("INSERT INTO TABLE boston SELECT a.datetime, a.Boston, b.Boston FROM humidity a JOIN pressure b ON (a.datetime=b.datetime)")
cursor.execute("select * from boston")
pd.DataFrame(cursor.fetchall())
cursor.execute("ALTER TABLE Boston ADD COLUMNS (temperature double)")
cursor.execute("INSERT OVERWRITE TABLE Boston SELECT a.datetime, a.humidity, a.pressure,b.Boston FROM boston a JOIN temperature b ON (a.datetime=b.datetime)")
cursor.execute("ALTER TABLE Boston ADD COLUMNS (windspeed int)")
cursor.execute("INSERT OVERWRITE TABLE Boston SELECT a.datetime, a.humidity, a.pressure,a.temperature,b.Boston FROM boston a JOIN windspeed b ON (a.datetime=b.datetime)")
cursor.execute("ALTER TABLE Boston ADD COLUMNS (description string)")
cursor.execute("INSERT OVERWRITE TABLE Boston SELECT a.datetime, a.humidity, a.pressure,a.temperature,a.windspeed,b.Boston FROM boston a JOIN description b ON (a.datetime=b.datetime)")
cursor.execute("select * from boston")
pd.DataFrame(cursor.fetchall())
cursor.execute("select * from boston")
boston=pd.DataFrame(cursor.fetchall())
boston.columns=['datetime','humidity','pressure','temperature','windspeed','description']
boston
cursor.execute('SELECT description,COUNT(description) FROM Boston GROUP BY description')
query = cursor.fetchall()
rows = pd.DataFrame(query, columns=(['description', 'Count description']))
rows.sort_values(by="Count description",inplace=True,ascending=False)
plt.figure(figsize=(18,9))
plt.bar(rows["description"],rows["Count description"])
plt.xticks(rotation=70,fontsize=12)
print('Repartition for weather by description')
sky is clear is the most frequent weather condition in Boston, followed by light rain and mist.
category_pie = boston['description'].value_counts().sort_values(ascending=True)
category_pie.plot.pie(figsize=(10, 10),autopct='%.2f')
cursor.execute("select description,AVG(humidity) from Boston GROUP BY description")
query=cursor.fetchall()
df1=pd.DataFrame(query)
df1.sort_values(1,inplace=True)
df1.columns=['weather_description','AVG_humidity']
df1.plot(x='weather_description',y='AVG_humidity',kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=True)
From this bar chart we can see that normally rain,snow,fog weather have a high humidity.
cursor.execute("select description,AVG(pressure) from Boston GROUP BY description")
query=cursor.fetchall()
df2=pd.DataFrame(query)
df2.sort_values(1,inplace=True)
df2.columns=['weather_description','AVG_pressure']
df2.plot(x='weather_description',y='AVG_pressure',kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=True)
cursor.execute("select description,AVG(temperature) from Boston GROUP BY description")
query=cursor.fetchall()
df3=pd.DataFrame(query)
df3.sort_values(1,inplace=True)
df3.columns=['weather_description','AVG_temperature']
df3.plot(x='weather_description',y='AVG_temperature',kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=True)
this chart is interesting, we can find three seperate stages. First, the lowest is snow weather in winter, of course the average temperature is relatively low; The highest, thounderstorm weather which usually happens in hot summer. As as for these bars in the middle, they don't have such clear season preference, so they are just at the average stage.
d1=df1.copy()
d2=df2.copy()
d3=df3.copy()
d1["AVG_humidity"]=(d1["AVG_humidity"]-d1["AVG_humidity"].min())/(d1["AVG_humidity"].max()-d1["AVG_humidity"].min())
d2["AVG_pressure"]=(d2["AVG_pressure"]-d2["AVG_pressure"].min())/(d2["AVG_pressure"].max()-d2["AVG_pressure"].min())
d3["AVG_temperature"]=(d3["AVG_temperature"]-d3["AVG_temperature"].min())/(d3["AVG_temperature"].max()-d3["AVG_temperature"].min())
d1=d1.merge(d2,on="weather_description")
d1=d1.merge(d3,on="weather_description")
d1.plot(x='weather_description',y=['AVG_humidity','AVG_pressure','AVG_temperature'],
kind='bar',rot=270,figsize=(15,7),fontsize=18,logy=False)
#clean data
bst=boston.copy()
bst.dropna(axis=0,how='any',inplace=True) # delete rows including empty elements
bst['datetime']=pd.to_datetime(bst['datetime'])
bst=bst.set_index('datetime')
bst.head(20)
bst['humidity'] = pd.to_numeric(bst['humidity'])
day_temperature=bst['humidity'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Humidity")
There is a unique peak in 2015, related to what we just learned from the humidity group by weather description, high humidity means high probablity of rain. Let's go further
cursor.execute("select * from Boston where description='light rain' or description='moderate rain' ")
light_rain=pd.DataFrame(cursor.fetchall())
light_rain.columns=["datetime","humidity","temperature","winddirection","windspeed","description"]
light_rain.dropna(axis=0,how='any',inplace=True)
light_rain['datetime']=pd.to_datetime(light_rain['datetime'])
light_rain=light_rain.set_index('datetime')
pd.DataFrame(light_rain.groupby(light_rain.index.year).count()["description"])
we can see that 2014 and 2015 has the most of rain records
bst['pressure'] = pd.to_numeric(bst['pressure'])
day_temperature=bst['pressure'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Pressure")
cursor.execute("select * from Boston where description='snow' or description='heavy snow' ")
light_rain=pd.DataFrame(cursor.fetchall())
light_rain.columns=["datetime","humidity","temperature","winddirection","windspeed","description"]
light_rain.dropna(axis=0,how='any',inplace=True)
light_rain['datetime']=pd.to_datetime(light_rain['datetime'])
light_rain=light_rain.set_index('datetime')
len(light_rain)
light_rain.index.month
pd.DataFrame(light_rain.groupby(light_rain.index.year).count()["description"])
bst['windspeed'] = pd.to_numeric(bst['windspeed'])
day_temperature=bst['windspeed'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Windspeed")
bst['temperature'] = pd.to_numeric(bst['temperature'])
day_temperature=bst['temperature'].groupby([bst.index.year,bst.index.month])
day_temperature.mean()
plt.figure(figsize=(15,7))
plt.plot(pd.date_range('10/2012','12/2017',freq='M'),day_temperature.mean())
plt.title("Temperature")
correlations=bst.corr()
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,4,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names=['humidity','pressure','temperature','wind_speed']
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
bst.corr()
x = boston['humidity']
y = boston['pressure']
z = boston['temperature']
s = boston['windspeed']
t = boston['description']
p = sns.pairplot(pd.DataFrame(list(zip(x, y,z, s,t)),
columns=['humidty', 'pressure', 'temperature', 'windspeed','description']), palette="Set1")
bst['description'].value_counts()
import sys
rain_forecast=bst.copy()
rain_forecast.replace({"sky is clear":0,"broken clouds":0,"mist":0,"scattered clouds":0,"overcast clouds":0,
"few clouds":0,'fog':0,'haze':0,'dust':0,'smoke':0,"squalls":0,"sand/dust whirls":0,
"sand":0,},inplace=True)
for i,j in enumerate(rain_forecast['description']):
if (j!=0):
rain_forecast.ix[i,'description']=1
rain_forecast['description'].value_counts()
rain_forecast
correlations=rain_forecast.corr()
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,5,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names=['humidity','pressure','temperature','wind_speed','rain_or_not']
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
rain_forecast.corr()
rain_forecast.rename(columns={"description":"rain_or_not"}, inplace=True)
rain_forecast
boston_mean = rain_forecast .resample('1D').mean() # min() or max()
boston_mean.rename(columns={"humidity":"meanhum", "pressure":"meanpres", "temperature":"meantemp","windspeed":"meanwsp"}, inplace=True)
boston_mean.drop(columns='rain_or_not',inplace=True)
boston_mean.head()
boston_min = rain_forecast .resample('1D').min() # min() or max()
boston_min.rename(columns={"humidity":"minhum", "pressure":"minpres", "temperature":"mintemp","windspeed":"minwsp"}, inplace=True)
boston_min.drop(columns='rain_or_not',inplace=True)
boston_min.head()
boston_max = rain_forecast.resample('1D').max()
boston_max.rename(columns={"humidity":"maxhum", "pressure":"maxpres", "temperature":"maxtemp","windspeed":"maxwsp"}, inplace=True)
boston_max.head()
b=pd.merge(boston_mean,boston_min,how='left',on='datetime')
b=pd.merge(b,boston_max,how='left',on='datetime')
b
b= b.round(2)
bb=b.copy()
bb
bb=bb.reset_index(drop=True)
tomorrow_meantemp=bb[["meantemp"]]
tomorrow_meantemp.drop(0,inplace=True)
tomorrow_meantemp.reset_index(drop=True,inplace=True)
tomorrow_meantemp.rename(columns={"meantemp":"tomor_meantemp"}, inplace=True)
tomorrow_meantemp
bb.drop(1886,inplace=True)
bb=pd.merge(bb,tomorrow_meantemp,how='left',left_index=True, right_index=True)
bb
from sklearn.metrics import explained_variance_score, \
mean_absolute_error, \
median_absolute_error
from sklearn.model_selection import train_test_split
features=["meanhum","minhum","maxhum","meanpres","minpres","maxpres","meanwsp","minwsp","maxwsp","meantemp","mintemp","maxtemp"]
X =bb[features]
y = bb[["tomor_meantemp"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=23)
X
# LinearRegression Model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
features=["meanhum","minhum","maxhum","meanpres","minpres","maxpres",
"meanwsp","minwsp","maxwsp","meantemp","mintemp","maxtemp"]
X =bb[features]
y = bb[["tomor_meantemp"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
random_state=23)
model=LinearRegression()
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))
import matplotlib.pyplot as plt
plt.figure(figsize=(18,9))
plt.plot(list(range(943)),y_test,'b')
plt.plot(list(range(943)),model.predict(X_test),'r')
plt.title("Prediction test")
plt.legend( ('test', 'prediction'))
plt.show()
plt.scatter(y_test,model.predict(X_test))
bb.describe().T
rain_forecast
rain_forecast['temperature'] = pd.to_numeric(rain_forecast['temperature'])
rain_forecast['humidity'] = pd.to_numeric(rain_forecast['humidity'])
rain_forecast['pressure'] = pd.to_numeric(rain_forecast['pressure'])
rain_forecast['windspeed'] = pd.to_numeric(rain_forecast['windspeed'])
from sklearn.linear_model import Perceptron
X_train=rain_forecast[["humidity","pressure","windspeed","temperature"]][0:33609].values
y_train=rain_forecast["rain_or_not"][0:33609].values
X_test=rain_forecast[["humidity","pressure","windspeed","temperature"]][33610:43609].values
y_test=rain_forecast["rain_or_not"][33610:43609].values
y_train=list(y_train)
y_test=list(y_test)
clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(X_train,y_train)
print('train dataset score:')
print(clf.score(X_train,y_train))
print('test dataset score:')
print(clf.score(X_test,y_test))
predict=clf.predict(X_test)
k=0
for i in zip(y_test,predict):
if list(i)[0]!=list(i)[1]:
k=k+1
print(k,len(y))
import plotly.graph_objects as go
cursor.execute("select * from temperature")
humidity=pd.DataFrame(cursor.fetchall(),columns=['datetime','Vancouver','Portland',
'SanFrancisco','Seattle','LosAngeles','SanDiego',
'LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
'Dallas','Houston','KansasCity','Minneapolis','SaintLouis',
'Chicago','Nashville','Indianapolis','Atlanta','Detroit',
'Jacksonville','Charlotte','Miami','Pittsburgh',
'Toronto INT','Philadelphia','NewYork','Montreal',
'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa',
'Nahariyya','Jerusalem'])
cursor.execute("select AVG(Vancouver),AVG(Portland),AVG(SanFrancisco),AVG(Seattle),AVG(LosAngeles)\
,AVG(SanDiego),AVG(LasVegas),AVG(Phoenix),AVG(Albuquerque),AVG(Denver),AVG(SanAntonio),AVG(Dallas)\
,AVG(Houston),AVG(KansasCity),AVG(Minneapolis),AVG(SaintLouis),AVG(Chicago),AVG(Nashville)\
,AVG(Indianapolis),AVG(Atlanta),AVG(Detroit),AVG(Jacksonville),AVG(Charlotte),AVG(Miami),AVG(Pittsburgh)\
,AVG(Toronto),AVG(Philadelphia),AVG(NewYork),AVG(Montreal),AVG(Boston),AVG(Beersheba),AVG(TelAvivDistrict)\
,AVG(Eilat),AVG(Haifa),AVG(Nahariyya),AVG(Jerusalem) from temperature")
avg_temp=cursor.fetchall()
avg_temp_c=np.array(avg_temp)-273.15
geo_avg_temp=pd.DataFrame(columns=["avg_temp","city_name","geo_code"])
geo_avg_temp["avg_temp"]=avg_temp_c.reshape(-1,)
geo_avg_temp["city_name"]=['Vancouver','Portland','SanFrancisco','Seattle','LosAngeles',
'SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
'Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago',
'Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte',
'Miami','Pittsburgh','Toronto','Philadelphia','NewYork','Montreal',
'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']
geo_avg_temp["geo_code"]=['Vancouver','Portland','CA','WA','LA','CA','NV',
'AZ','NM','CO','TX','TX','TX',
'MO','MN','MO','IL','TN','IN',
'GA','MI','FL','NC','FL','PA','ON',
'PA','NY','Montreal','MA','Beersheba','TelAvivDistrict','Eilat',
'Haifa','Nahariyya','Jerusalem']
geo_avg_temp
new=pd.DataFrame(geo_avg_temp.groupby("geo_code").mean())
new['avg_temp']
fig = go.Figure(data=go.Choropleth(
locations=new.index, # Spatial coordinates
z = new["avg_temp"], # Data to be color-coded
colorscale = 'Reds',
locationmode = 'USA-states',
colorbar_title = "temperature",
))
fig.update_layout(
title_text = 'Average temperature',
geo_scope='usa', # limite map scope to USA
)
fig.show()
cursor.execute("select * from humidity")
humidity=pd.DataFrame(cursor.fetchall(),columns=['datetime','Vancouver','Portland',
'SanFrancisco','Seattle','LosAngeles','SanDiego',
'LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
'Dallas','Houston','KansasCity','Minneapolis','SaintLouis',
'Chicago','Nashville','Indianapolis','Atlanta','Detroit',
'Jacksonville','Charlotte','Miami','Pittsburgh',
'Toronto INT','Philadelphia','NewYork','Montreal',
'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa',
'Nahariyya','Jerusalem'])
cursor.execute("select AVG(Vancouver),AVG(Portland),AVG(SanFrancisco),AVG(Seattle),AVG(LosAngeles)\
,AVG(SanDiego),AVG(LasVegas),AVG(Phoenix),AVG(Albuquerque),AVG(Denver),AVG(SanAntonio),AVG(Dallas)\
,AVG(Houston),AVG(KansasCity),AVG(Minneapolis),AVG(SaintLouis),AVG(Chicago),AVG(Nashville)\
,AVG(Indianapolis),AVG(Atlanta),AVG(Detroit),AVG(Jacksonville),AVG(Charlotte),AVG(Miami),AVG(Pittsburgh)\
,AVG(Toronto),AVG(Philadelphia),AVG(NewYork),AVG(Montreal),AVG(Boston),AVG(Beersheba),AVG(TelAvivDistrict)\
,AVG(Eilat),AVG(Haifa),AVG(Nahariyya),AVG(Jerusalem) from humidity")
avg_h=cursor.fetchall()
geo_avg_h=pd.DataFrame(columns=["avg_h","city_name","geo_code"])
geo_avg_h["avg_h"]=np.array(avg_h).reshape(-1,)
geo_avg_h["city_name"]=['Vancouver','Portland','SanFrancisco','Seattle','LosAngeles',
'SanDiego','LasVegas','Phoenix','Albuquerque','Denver','SanAntonio',
'Dallas','Houston','KansasCity','Minneapolis','SaintLouis','Chicago',
'Nashville','Indianapolis','Atlanta','Detroit','Jacksonville','Charlotte',
'Miami','Pittsburgh','Toronto','Philadelphia','NewYork','Montreal',
'Boston','Beersheba','TelAvivDistrict','Eilat','Haifa','Nahariyya','Jerusalem']
geo_avg_h["geo_code"]=['Vancouver','Portland','CA','WA','LA','CA','NV',
'AZ','NM','CO','TX','TX','TX',
'MO','MN','MO','IL','TN','IN',
'GA','MI','FL','NC','FL','PA','ON',
'PA','NY','Montreal','MA','Beersheba','TelAvivDistrict','Eilat',
'Haifa','Nahariyya','Jerusalem']
geo_avg_h
humidity.drop(columns="datetime").mean(axis=0)
new_h=pd.DataFrame(geo_avg_h.groupby("geo_code").mean())
new_h['avg_h']
fig = go.Figure(data=go.Choropleth(
locations=new_h.index, # Spatial coordinates
z = new_h["avg_h"], # Data to be color-coded
colorscale = 'Blues',
locationmode = 'USA-states',
colorbar_title = "humidity",
))
fig.update_layout(
title_text = 'Average humidity',
geo_scope='usa', # limite map scope to USA
)
fig.show()
The Hourly Dataset contains too much information. The exploration follows in these main parts.
1.We extract weather features from each table and insert into a joint table Boston to explore a specific city's weather situation, such as the most frequent weather, average weather features group by weather description, the curves of these weather features in time sequence and we found some interesting conclusion such as the relationship between weather conditions and these numerical features and the 2015 Boston Snowstorm event and so on.
2.we try to go furture in Boston dataset by using machine learning. We did temprature regression and rain prediction and the outcomes were satisfying.
3.we try to explore the potential of these data by using plotly library. We use hive to calculate the average of these numerical features like temperature of different cities and show their distribution on map